http://genomicsclass.github.io/book/pages/rnaseq_gene_level.html http://www.bioconductor.org/help/workflows/rnaseqGene

Introduction to RNA-seq

RNA-seq is a technology that uses next-generation/high-throughput sequencing (NGS/HTS) to reveal a snapshot of the kind and amount of RNA molecules in a biological sample at a given moment in time.

Transcript quantification based on RNA-seq counts is unstable due to ambiguity and bias (not-uniform coverage along the length of the transcript due to technical bias).

Here, we will focus on comparing the expression levels of genes across different samples, by counting the number of reads which overlap the exons of genes defined by a known annotation of the reference genome.


The FPKM (fragments per kilobase of sequence per million mapped reads) is a normalized measure of expression level (having divided out the transcript length and the number of mapped reads).

TPM (transcripts per million), is a linear scaling of the FPKM, such that we would expect a gene with 1 TPM to have one molecule in a population of one million mRNA.

TPM is equal to: (FPKM / sum(FPKM)) * 1 million

CPM (counts per million)

RNA-seq data sources

Quality control and pre-processing of RNA-seq reads using FastQC and other tools

Alignment of RNA-seq reads to a reference genome/transcriptome

Aligners generate SAM/BAM files containing the reads from each biological sample aligned to a reference genome from:

  1. A FASTQ file containing the unaligned reads
  2. A FASTA file containing the reference genome
  3. A GTF file containing the annotation (gene models) for the reference genome.

See also: Illumina iGenomes

Alignment of RNA-seq reads to a reference genome using the STAR, TopHat or RSEM aligner

Counting RNA-seq reads in genes/exons

A count matrix is a summarized version of an RNA-seq experiment which has genes along the rows and biological samples along the columns. The values in the matrix are the number of RNA-seq reads which could be uniquely aligned to the exons of a given gene for a given sample.

First, we define variables pointing to 1) a sample table (containing the metadata for each sample), 2) the BAM files (containing the reads from each biological sample aligned to a reference genome), and 3) a GTF file (containing the annotation for the reference genome that was used to generate the BAM files). Use the sample table to contruct the BAM files vector, so that the count matrix will be in the same order as the sample table.

Himes2014 describes the data from the airway package that are used in the following examples.

library(airway)
dir <- system.file("extdata", package = "airway", mustWork = TRUE)
csv_file <- file.path(dir, "sample_table.csv")
sample_table <- read.csv(csv_file, row.names = 1)
print(sample_table)
##            SampleName    cell   dex albut        Run avgLength Experiment
## SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345
## SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349
## SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353
## SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357
## SRR1039521 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358
##               Sample    BioSample
## SRR1039508 SRS508568 SAMN02422669
## SRR1039509 SRS508567 SAMN02422675
## SRR1039512 SRS508571 SAMN02422678
## SRR1039513 SRS508572 SAMN02422670
## SRR1039516 SRS508575 SAMN02422682
## SRR1039517 SRS508576 SAMN02422673
## SRR1039520 SRS508579 SAMN02422683
## SRR1039521 SRS508580 SAMN02422677
bam_files <- file.path(dir, paste0(sample_table$Run, "_subset.bam"))
gtf_file <- file.path(dir, "Homo_sapiens.GRCh37.75_subset.gtf")

Next, we make a BamFileList object bam_file_list which wraps the BAM files. We can ignore the warning about matchCircularity.

library(Rsamtools)
## Loading required package: XVector
## Loading required package: Biostrings
bam_files_list <- BamFileList(bam_files) # see also yieldSize option if working in an environment with a limited amount of memory

Finally, we make a GRangesList object exons_by_gene which contains the exons for each gene.

library(GenomicFeatures)
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## 
## Attaching package: 'AnnotationDbi'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
txdb <- makeTxDbFromGFF(gtf_file, format = "gtf")
## Import genomic features from the file as a GRanges object ... OK
## Prepare the 'metadata' data frame ... OK
## Make the TxDb object ... OK
exons_by_gene <- exonsBy(txdb, by = "gene")

Next, we create a SummarizedExperiment object se which contains the counts for the reads in each BAM file in bam_files_list (columns) aligned to each gene in exons_by_gene (rows). We add the sample_table dataframe as column data. The column order is correct, because the bam_files_list was constructed from the sample_table.

library(GenomicAlignments)
## 
## Attaching package: 'GenomicAlignments'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
se <- summarizeOverlaps(exons_by_gene, bam_files_list,
                        mode = "Union",
                        singleEnd = FALSE,
                        fragments = TRUE,
                        ignore.strand = TRUE)

head(colData(se))
## DataFrame with 6 rows and 0 columns
colData(se) <- DataFrame(sample_table)
head(colData(se))
## DataFrame with 6 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
rowRanges(se) # same as rowData()
## GRangesList object of length 20:
## $ENSG00000009724 
## GRanges object with 18 ranges and 2 metadata columns:
##        seqnames               ranges strand   |   exon_id       exon_name
##           <Rle>            <IRanges>  <Rle>   | <integer>     <character>
##    [1]        1 [11086580, 11087705]      -   |        98 ENSE00000818830
##    [2]        1 [11090233, 11090307]      -   |        99 ENSE00000472123
##    [3]        1 [11090805, 11090939]      -   |       100 ENSE00000743084
##    [4]        1 [11094885, 11094963]      -   |       101 ENSE00000743085
##    [5]        1 [11097750, 11097868]      -   |       103 ENSE00003520086
##    ...      ...                  ...    ... ...       ...             ...
##   [14]        1 [11106948, 11107176]      -   |       111 ENSE00003467404
##   [15]        1 [11106948, 11107176]      -   |       112 ENSE00003489217
##   [16]        1 [11107260, 11107280]      -   |       113 ENSE00001833377
##   [17]        1 [11107260, 11107284]      -   |       114 ENSE00001472289
##   [18]        1 [11107260, 11107290]      -   |       115 ENSE00001881401
## 
## ...
## <19 more elements>
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(assay(se)) # assay(se) is the count matrix
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000009724 38 28 66 24 42 41 47 36
ENSG00000116649 1004 1255 1122 1313 1100 1879 745 1536
ENSG00000120942 218 256 233 252 269 465 207 400
ENSG00000120948 2751 2080 3353 1614 3519 3716 2220 1990
ENSG00000171819 4 50 19 543 1 10 14 1067
ENSG00000171824 869 1075 1115 1051 944 1405 748 1590

We can use the featureCounts function in the Rsubread package as an alternative to the summarizeOverlaps function in the GenomicAlignments package:

library(Rsubread)
fc <- featureCounts(bam_files, annot.ext = gtf_file, isGTFAnnotationFile = TRUE, isPairedEnd = TRUE)
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.18.0
## 
## //========================== featureCounts setting ===========================\\
## ||                                                                            ||
## ||             Input files : 8 BAM files                                      ||
## ||                           P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## ||                           P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## ||                           P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## ||                           P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## ||                           P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## ||                           P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## ||                           P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## ||                           P /Users/marcoe02/Library/R/3.2/library/airw ... ||
## ||                                                                            ||
## ||             Output file : ./.Rsubread_featureCounts_pid10470               ||
## ||             Annotations : /Users/marcoe02/Library/R/3.2/library/airway ... ||
## ||                                                                            ||
## ||                 Threads : 1                                                ||
## ||                   Level : meta-feature level                               ||
## ||              Paired-end : yes                                              ||
## ||         Strand specific : no                                               ||
## ||      Multimapping reads : not counted                                      ||
## || Multi-overlapping reads : not counted                                      ||
## ||                                                                            ||
## ||          Chimeric reads : counted                                          ||
## ||        Both ends mapped : not required                                     ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Load annotation file /Users/marcoe02/Library/R/3.2/library/airway/extd ... ||
## ||    Features : 406                                                          ||
## ||    Meta-features : 20                                                      ||
## ||    Chromosomes : 1                                                         ||
## ||                                                                            ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    2 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 7142                                                  ||
## ||    Successfully assigned fragments : 6649 (93.1%)                          ||
## ||    Running time : 0.14 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    1 read has missing mates.                                               ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 7200                                                  ||
## ||    Successfully assigned fragments : 6712 (93.2%)                          ||
## ||    Running time : 0.13 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    2 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 8536                                                  ||
## ||    Successfully assigned fragments : 7910 (92.7%)                          ||
## ||    Running time : 0.15 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    1 read has missing mates.                                               ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 7544                                                  ||
## ||    Successfully assigned fragments : 7044 (93.4%)                          ||
## ||    Running time : 0.14 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    2 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 8818                                                  ||
## ||    Successfully assigned fragments : 8261 (93.7%)                          ||
## ||    Running time : 0.15 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    6 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 11850                                                 ||
## ||    Successfully assigned fragments : 11148 (94.1%)                         ||
## ||    Running time : 0.15 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    6 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 5877                                                  ||
## ||    Successfully assigned fragments : 5415 (92.1%)                          ||
## ||    Running time : 0.13 minutes                                             ||
## ||                                                                            ||
## || Process BAM file /Users/marcoe02/Library/R/3.2/library/airway/extdata/ ... ||
## ||    Paired-end reads are included.                                          ||
## ||    Assign fragments (read pairs) to features...                            ||
## ||    Found reads that are not properly paired.                               ||
## ||    (missing mate or the mate is not the next read)                         ||
## ||    3 reads have missing mates.                                             ||
## ||    Input was converted to a format accepted by featureCounts.              ||
## ||    Total fragments : 10208                                                 ||
## ||    Successfully assigned fragments : 9538 (93.4%)                          ||
## ||    Running time : 0.15 minutes                                             ||
## ||                                                                            ||
## ||                         Read assignment finished.                          ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
names(fc)
## [1] "counts"     "annotation" "targets"    "stat"
head(fc$counts) # fc$counts is the count matrix
X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039508_subset.bam X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039509_subset.bam X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039512_subset.bam X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039513_subset.bam X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039516_subset.bam X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039517_subset.bam X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039520_subset.bam X.Users.marcoe02.Library.R.3.2.library.airway.extdata.SRR1039521_subset.bam
ENSG00000175262 0 0 4 1 0 0 1 0
ENSG00000215785 0 0 1 0 0 0 0 0
ENSG00000264181 0 0 0 0 0 0 0 0
ENSG00000207213 0 0 0 0 0 0 0 0
ENSG00000120948 2673 2031 3263 1570 3446 3615 2171 1949
ENSG00000009724 38 28 66 24 42 41 47 36

Normalization, transformation and exploratory data analysis of RNA-seq counts

We now load the full SummarizedExperiment object, counting reads over all the genes.

library(airway)
data(airway)
print(airway)
## class: SummarizedExperiment 
## dim: 64102 8 
## exptData(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowRanges metadata column names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData(airway)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength
##              <factor> <factor> <factor> <factor>   <factor> <integer>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98
##            Experiment    Sample    BioSample
##              <factor>  <factor>     <factor>
## SRR1039508  SRX384345 SRS508568 SAMN02422669
## SRR1039509  SRX384346 SRS508567 SAMN02422675
## SRR1039512  SRX384349 SRS508571 SAMN02422678
## SRR1039513  SRX384350 SRS508572 SAMN02422670
## SRR1039516  SRX384353 SRS508575 SAMN02422682
## SRR1039517  SRX384354 SRS508576 SAMN02422673
## SRR1039520  SRX384357 SRS508579 SAMN02422683
## SRR1039521  SRX384358 SRS508580 SAMN02422677
rowRanges(airway)
## GRangesList object of length 64102:
## $ENSG00000000003 
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames               ranges strand   |   exon_id       exon_name
##           <Rle>            <IRanges>  <Rle>   | <integer>     <character>
##    [1]        X [99883667, 99884983]      -   |    667145 ENSE00001459322
##    [2]        X [99885756, 99885863]      -   |    667146 ENSE00000868868
##    [3]        X [99887482, 99887565]      -   |    667147 ENSE00000401072
##    [4]        X [99887538, 99887565]      -   |    667148 ENSE00001849132
##    [5]        X [99888402, 99888536]      -   |    667149 ENSE00003554016
##    ...      ...                  ...    ... ...       ...             ...
##   [13]        X [99890555, 99890743]      -   |    667156 ENSE00003512331
##   [14]        X [99891188, 99891686]      -   |    667158 ENSE00001886883
##   [15]        X [99891605, 99891803]      -   |    667159 ENSE00001855382
##   [16]        X [99891790, 99892101]      -   |    667160 ENSE00001863395
##   [17]        X [99894942, 99894988]      -   |    667161 ENSE00001828996
## 
## ...
## <64101 more elements>
## -------
## seqinfo: 722 sequences (1 circular) from an unspecified genome
head(assay(airway))
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 679 448 873 408 1138 1047 770 572
ENSG00000000005 0 0 0 0 0 0 0 0
ENSG00000000419 467 515 621 365 587 799 417 508
ENSG00000000457 260 211 263 164 245 331 233 229
ENSG00000000460 60 55 40 35 78 63 76 60
ENSG00000000938 0 0 2 0 1 0 0 0

First, note that different samples have been sequenced at different depths and we need to normalize for this in order to more easily compare samples.

sort(round(colSums(assay(airway))/1e6, 2))
## SRR1039513 SRR1039509 SRR1039520 SRR1039508 SRR1039521 SRR1039516 
##      15.16      18.81      19.13      20.64      21.16      24.45 
## SRR1039512 SRR1039517 
##      25.35      30.82

In addition, with un-transformed gene counts, the high count genes have high variance. That is, in the following scatter plot, the points start out in a tight cone and then fan out toward the top right. This is a general property of counts generated from sampling processes, that the variance typically increases with the expected value. If these genes were included in a distance calculation (e.g., for sample clustering), the high variance at the high count range could overwhelm the signal at the lower count range. We will explore different transformations below to correct for this effect (e.g., the log2 and rlog transformations).

plot(assay(airway)[, 1:2], cex = .1, main = "untransf, not norm")

Normalization for sequencing depth

We use the DESeq2 package to normalize for sequencing depth. The DESeqDataSet object is an extension of the SummarizedExperiment object, with a few changes. The matrix in assay is now accessed with counts and the elements of this matrix are required to be non-negative integers (0,1,2,…).

We specify an experimental design here, for later use, although for estimating size factors, we could just use ~ 1 as a default design. The variables are columns of the colData, and the + indicates that for differential expression analysis we want to compare levels of dex while controlling for the cell differences.

library(DESeq2)
## Loading required package: Rcpp
## Loading required package: RcppArmadillo
dds <- DESeqDataSet(airway, design = ~ cell + dex)

A DESeqDataSet object can also be constructed using a count matrix and column/sample data!

dds2 <- DESeqDataSetFromMatrix(fc$counts, colData = sample_table, design = ~ cell + dex)

See also: DESeqDataSetFromHTSeqCount()

Please not that the last variable in the design (in this case the dex variable/treatment) is used by default for building the results table!

Estimate the size factors to account for differences in sequencing depth. The size factors are similar to colum sums, but are more robust.

dds <- estimateSizeFactors(dds)
sizeFactors(dds)
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 
##  1.0236476  0.8961667  1.1794861  0.6700538  1.1776714  1.3990365 
## SRR1039520 SRR1039521 
##  0.9207787  0.9445141
colSums(counts(dds))
## SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 
##   20637971   18809481   25348649   15163415   24448408   30818215 
## SRR1039520 SRR1039521 
##   19126151   21164133
plot(sizeFactors(dds), colSums(counts(dds)), cex = .1) 
abline(lm(colSums(counts(dds)) ~ sizeFactors(dds) + 0))

Size factors are calculated as the median ratio of each sample to a pseudo-sample (i.e., the geometric mean of all samples). In other words, for each sample, we take the exponent of the median of the log ratios (of each sample to a pseudo-sample). For example, the size factor for the first sample:

n.b.: log(a/b) = log(a) - log(b)

loggeomeans <- rowMeans(log(counts(dds)))
exp(median((log(counts(dds)[, 1]) - loggeomeans)[is.finite(loggeomeans)]))
## [1] 1.023648
sizeFactors(dds)[1]
## SRR1039508 
##   1.023648

Examine the log2-transformed counts and the log2-transformed and depth-normalized counts (plus a pseudocount), excluding genes with zero counts across all samples.

rs = rowSums(counts(dds))
mypar(1,2)
boxplot(log2(counts(dds)[rs > 0, ] + 1), main = "log2-trans, not norm")
boxplot(log2(counts(dds, normalized = TRUE)[rs > 0, ] + 1), main = "log2-trans, depth-norm")

Make a matrix of log2 normalized counts (plus a pseudocount):

log2.norm.counts = log2(counts(dds, normalized = TRUE) + 1)

Transformations for variance stabilization

Make a scatterplot of the log2 normalized counts (plus a pseudocount) against each other. Note the fanning out of the points in the lower left corner (the opposite of the effect observed when using the untransformed counts).

plot(log2.norm.counts[, 1:2], cex = .1, main = "log2-trans, depth-norm")

Now we will use a more sophisticated transformation (regularized logarithm or rlog), which uses the variance model for count data to shrink together the log-transformed counts for genes with very low counts. For genes with medium and high counts, the rlog is very close to log2. Another transformation for stabilizing variance in the DESeq2 package is the appropriately named varianceStabilizingTransformation. These two tranformations are similar, although the rlog might perform better when the size factors vary widely.

rlog.dds <- rlog(dds)
rlog.norm.counts <- assay(rlog.dds)
plot(rlog.norm.counts[, 1:2], cex = .1, main = "rlog-trans, depth-norm")

We can examine the standard deviation of rows over the mean for the log2-transformed and depth-normalized counts (plus pseudocount) and the rlog transformation.

library(vsn)
mypar(1, 2)
meanSdPlot(log2.norm.counts, ranks = FALSE, ylim = c(0, 3), main = "log2")
meanSdPlot(rlog.norm.counts, ranks = FALSE, ylim = c(0, 3), main = "rlog")

Note that the genes with high variance for the log2 normalized counts (plus a pseudocount) transformation come from the genes with lowest mean. If these genes were included in a distance calculation (e.g., for sample clustering), the high variance at the low count range could overwhelm the signal at the higher count range.

Exploratory data analysis (EDA) of RNA-seq counts

With normalized and variance-stabilized gene counts across samples, we can use principal components analysis (PCA) as a useful diagnostics for examining relationships between samples.

Using the log2 normalized counts (plus a pseudocount) transformation:

# To speed up the PCA, include only the 500 genes with the most variance across samples
rowvar <- apply(log2.norm.counts, 1, var)
topgenes_idx <- head(order(rowvar, decreasing = TRUE), 500)
pc <- prcomp(t(log2.norm.counts[topgenes_idx, ])) # transpose because prcomp expects samples to be in rows
mypar()
plot(pc$x[,1], pc$x[,2], col = colData(dds)$dex, pch = as.integer(colData(dds)$cell))

Using the rlog transformation (and the plotPCA function in DESeq2 instead of the standard plot function):

plotPCA(rlog.dds, intgroup = "dex")

plotPCA(rlog.dds, intgroup = c("dex", "cell"))

We can make this plot even nicer using the ggplot2 library to customize the output of the plotPCA function.

library(ggplot2)
(data <- plotPCA(rlog.dds, intgroup = c("dex", "cell"), returnData = TRUE))
PC1 PC2 group dex cell name
SRR1039508 -14.331359 -4.208796 untrt : N61311 untrt N61311 SRR1039508
SRR1039509 6.754169 -2.245244 trt : N61311 trt N61311 SRR1039509
SRR1039512 -8.130393 -3.952904 untrt : N052611 untrt N052611 SRR1039512
SRR1039513 14.505648 -2.941863 trt : N052611 trt N052611 SRR1039513
SRR1039516 -11.891409 13.735002 untrt : N080611 untrt N080611 SRR1039516
SRR1039517 8.373975 17.823844 trt : N080611 trt N080611 SRR1039517
SRR1039520 -9.965898 -10.014674 untrt : N061011 untrt N061011 SRR1039520
SRR1039521 14.685269 -8.195366 trt : N061011 trt N061011 SRR1039521
(percentVar <- 100 * round(attr(data, "percentVar"), 2)) 
## [1] 39 27
makeLab <- function(x, pc) paste0("PC", pc, ": ", x, "% variance")
ggplot(data, aes(PC1, PC2, col = dex, shape = cell)) + geom_point() + xlab(makeLab(percentVar[1], 1)) + ylab(makeLab(percentVar[2], 2)) + theme_bw()

In addition, we can plot a hierarchical clustering tree based on a Euclidean distance matrix between samples.

mypar(1,2)
plot(hclust(dist(t(log2.norm.counts))), labels = colData(dds)$dex, main = "log2")
plot(hclust(dist(t(rlog.norm.counts))), labels = colData(rlog.dds)$dex, main = "rlog")

Modeling RNA-seq counts for differential expression analysis (DEA)

In the absence of biological and technical variability, RNA-seq counts behave much like Poisson-distributed random variables, with the variability coming only from random sampling.

Often, we are interested in the log ratios of counts (one ratio for each gene/exon/transcript/feature of interest) to compare two samples or conditions. Beware that the log ratios of counts of two Poisson-distributed random variables have higher variance at low lambdas (i.e., transcripts with low abundance).

log2 ratio of Poisson random variables

log2 ratio of real RNA-seq data

In reality (i.e., in the presence of biological and technical variability) the behavior of RNA-seq counts can be better approximated by negative binomial (NB)-distributed random variables (for which the mean exceeds the variance), with the variability coming from random sampling and additional sources causing over-dispersion with respect to the Poisson distribution (for which the mean equals the variance).

mypar(3,1)
n <- 10000
brks <- 0:400
hist(rpois(n, lambda = 100), main = "Poisson/NB, disp = 0", xlab = "", breaks = brks, col = "black") 
hist(rnbinom(n, mu = 100, size = 1/.01), main = "NB, disp = 0.01", xlab = "", breaks = brks, col = "black")
hist(rnbinom(n, mu = 100, size = 1/.1), main = "NB, disp = 0.1", xlab = "", breaks = brks, col = "black")

The sqrt of the dispersion coefficient in the NB distribution is the coefficient of variation (i.e., stdev/mu) after subtracting the variance due to random sampling (i.e., the coefficient of variation of technical+biological variability), which is a measure of over-dispersion with respect to the Poisson distribution.

Why modeling raw RNA-seq counts instead of normalized RNA-seq counts for DEA?

Raw RNA-seq counts:

\[K_{ij} \sim \text{NB}(\mu_{ij} = s_{ij} q_{ij} )\]

Normalized RNA-seq counts:

\[\frac{K_{ij}}{s_{ij}} \sim \mathcal{L}(\mu_{ij} = q_{ij})\]

When normalizing, i.e., scaling up or down, Poisson-distributed random variables/the raw RNA-seq counts, we break the implicit link between the mean and the variance.

mypar(3,1)
n <- 10000
brks <- 0:400
hist(rpois(n,100), main = "", xlab = "", breaks=brks, col = "black")
hist(rpois(n,1000)/10, main = "", xlab = "", breaks=brks, col = "black")
hist(rpois(n,10)*10, main = "", xlab = "", breaks=brks, col = "black")

This becomes a problem when we have a small number of biological replicates and thus we cannot robustly estimate the within-group variance. Thus, we can get a benefit of information from using the raw RNA-seq counts, and incorporating normalization factors on the right side of the equation above instead.

Differential expression analysis at the gene level

We are interested in the effect of the dex treatment expressed as log2 ratio of treated over untreated/control and thus we need to make sure that the untreated/control level of the dex variable/treatment is the first level.

levels(dds$dex)
## [1] "trt"   "untrt"
dds$dex <- relevel(dds$dex, "untrt")
levels(dds$dex)
## [1] "untrt" "trt"

We then run the DESeq2 model and inspect the results table (built according to the last variable in the design, in this case the dex variable/treatment).

dds <- DESeq(dds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
res <- results(dds)
head(res)
## log2 fold change (MAP): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 6 rows and 6 columns
##                    baseMean log2FoldChange      lfcSE       stat
##                   <numeric>      <numeric>  <numeric>  <numeric>
## ENSG00000000003 708.6021697    -0.37424998 0.09873107 -3.7906000
## ENSG00000000005   0.0000000             NA         NA         NA
## ENSG00000000419 520.2979006     0.20215551 0.10929899  1.8495642
## ENSG00000000457 237.1630368     0.03624826 0.13684258  0.2648902
## ENSG00000000460  57.9326331    -0.08523371 0.24654402 -0.3457140
## ENSG00000000938   0.3180984    -0.11555969 0.14630532 -0.7898530
##                       pvalue        padj
##                    <numeric>   <numeric>
## ENSG00000000003 0.0001502838 0.001217611
## ENSG00000000005           NA          NA
## ENSG00000000419 0.0643763851 0.188306353
## ENSG00000000457 0.7910940556 0.907203245
## ENSG00000000460 0.7295576905 0.874422374
## ENSG00000000938 0.4296136373          NA
# padj is calculated using BH -> FDR
table(res$padj < 0.1)
FALSE TRUE
12644 4897
summary(res)
## 
## out of 33469 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)     : 2646, 7.9% 
## LFC < 0 (down)   : 2251, 6.7% 
## outliers [1]     : 0, 0% 
## low counts [2]   : 15928, 48% 
## (mean count < 5.3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Please note that the default behavior is to calculate the adjusted pval using BH and to filter the results using a FDR of 10%. However, it is possible to change the FDR by specifying the alpha' argument of theresults` function.

We can visualize the results using an MA-plot, where the y-axis M is the log2 fold change, the x-axis A is the mean raw count across all samples.

mypar()
plotMA(res, ylim = c(-4, 4))

Please note that the default statistical test looks for log2 fold changes that are different than zero (i.e., the null hypothesis of no change in gene expression), which are marked in red. However, it is possible to only look for log2 fold changes that are higher than a certain threshold by specifying the lfcThreshold' argument of theresults` function.

res2 <- results(dds, lfcThreshold = 1)
plotMA(res2, ylim = c(-4, 4))

DESeq2 automatically filters out low count genes, and chooses a threshold by optimizing the number of genes with adjusted p-value < alpha. Adjusted p-values are larger than the original p-value. The amount that the adjusted p-value is raised depends on the number of tests. If we can remove genes which have no power to find differences (genes with very small counts, where the differences are obscured by noise), we can increase power because we run fewer tests and thus have smaller adjusted p-values. A review of this filtering strategy is presented in Bourgon2010.

DESeq2 also automatically filters out (or with more samples, replaces) genes with outlier counts, to leave only genes with consistent differences across condition. Both DESeq2 filters can be customized or turned off by the arguments listed in the summary function output.


We mentioned that gene counts are proportional to gene expression, as well as sequencing depth, average transcript length and other technical bias factors. In this analysis we accounted for sequencing depth changes across samples, and we assumed that the other factors cancelled out when calculating fold changes across samples. However, it is possible to account for these differences across samples as well, by using other software to estimate sample-specific bias parameters (e.g., the CQN or EDASeq package) and provide this information to the normalizationFactors function before running the DESeq function. We could also use software like RSEM to estimate the average transcript length for each gene and sample and provide this information to the estimateSizeFactors function.


Next, we sort the results by adjusted pval in order to inspect the top DEGs…

res.sorted_by_padj <- res[order(res$padj), ]
head(res.sorted_by_padj)
## log2 fold change (MAP): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 6 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE      stat
##                  <numeric>      <numeric> <numeric> <numeric>
## ENSG00000152583   997.4398       4.316100 0.1724127  25.03354
## ENSG00000165995   495.0929       3.188698 0.1277441  24.96160
## ENSG00000101347 12703.3871       3.618232 0.1499441  24.13054
## ENSG00000120129  3409.0294       2.871326 0.1190334  24.12201
## ENSG00000189221  2341.7673       3.230629 0.1373644  23.51868
## ENSG00000211445 12285.6151       3.552999 0.1589971  22.34631
##                        pvalue          padj
##                     <numeric>     <numeric>
## ENSG00000152583 2.637881e-138 4.627108e-134
## ENSG00000165995 1.597973e-137 1.401503e-133
## ENSG00000101347 1.195378e-128 6.441180e-125
## ENSG00000120129 1.468829e-128 6.441180e-125
## ENSG00000189221 2.627083e-122 9.216332e-119
## ENSG00000211445 1.311440e-110 3.833996e-107

…and plot the counts for the top DEG.

plotCounts(dds, gene = which.min(res$padj), intgroup = "dex")

data <- plotCounts(dds, gene = which.min(res$padj), intgroup = c("dex", "cell"), returnData = TRUE) 
head(data)
count dex cell
SRR1039508 61.06772 untrt N61311
SRR1039509 2276.86214 trt N61311
SRR1039512 84.43486 untrt N052611
SRR1039513 1749.61320 trt N052611
SRR1039516 85.41333 untrt N080611
SRR1039517 1375.73220 trt N080611
library(ggplot2)

ggplot(data, aes(x = dex, y = count, col = cell)) +
  geom_point(position = position_jitter(width = .1, height = 0)) +
  scale_y_log10() +
  theme_bw()

ggplot(data, aes(x = dex, y = count, col = cell, group = cell)) +
  geom_point() +
  geom_line() +
  scale_y_log10() +
  theme_bw()

To display the counts for more than one gene we can use a heatmap.

library(pheatmap)
topgenes <- head(rownames(res.sorted_by_padj), 20)
print(topgenes)
##  [1] "ENSG00000152583" "ENSG00000165995" "ENSG00000101347"
##  [4] "ENSG00000120129" "ENSG00000189221" "ENSG00000211445"
##  [7] "ENSG00000157214" "ENSG00000162614" "ENSG00000125148"
## [10] "ENSG00000154734" "ENSG00000139132" "ENSG00000162493"
## [13] "ENSG00000162692" "ENSG00000179094" "ENSG00000134243"
## [16] "ENSG00000163884" "ENSG00000178695" "ENSG00000146250"
## [19] "ENSG00000198624" "ENSG00000148848"
# use rlog-transformed and depth-normalized counts instead of the raw counts for this kind of viz
mat <- rlog.norm.counts[topgenes, ]

# subtract the mean to generate a more uniform viz that is not driven by mean level of gene expression but rather displays how gene expression varies about the mean level (irrespective of what that is)
mat <- mat - rowMeans(mat)
df <- as.data.frame(colData(dds)[, c("dex", "cell")])
pheatmap(mat, annotation_col = df)

We can also add annotations to the results.

library(org.Hs.eg.db)
## Loading required package: DBI
keytypes(org.Hs.eg.db)
##  [1] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"     
##  [5] "ACCNUM"       "ALIAS"        "ENZYME"       "MAP"         
##  [9] "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [13] "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
## [17] "GENENAME"     "UNIPROT"      "GO"           "EVIDENCE"    
## [21] "ONTOLOGY"     "GOALL"        "EVIDENCEALL"  "ONTOLOGYALL" 
## [25] "OMIM"         "UCSCKG"
head(rownames(dds))
## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
## [5] "ENSG00000000460" "ENSG00000000938"
anno <- select(org.Hs.eg.db, keys = topgenes, columns = c("SYMBOL", "GENENAME"), keytype = "ENSEMBL")
anno[match(topgenes, anno$ENSEMBL), ] # see also the mapIds function
ENSEMBL SYMBOL GENENAME
1 ENSG00000152583 SPARCL1 SPARC-like 1 (hevin)
2 ENSG00000165995 CACNB2 calcium channel, voltage-dependent, beta 2 subunit
3 ENSG00000101347 SAMHD1 SAM domain and HD domain 1
4 ENSG00000120129 DUSP1 dual specificity phosphatase 1
5 ENSG00000189221 MAOA monoamine oxidase A
6 ENSG00000211445 GPX3 glutathione peroxidase 3
7 ENSG00000157214 STEAP2 STEAP family member 2, metalloreductase
8 ENSG00000162614 NEXN nexilin (F actin binding protein)
9 ENSG00000125148 MT2A metallothionein 2A
10 ENSG00000154734 ADAMTS1 ADAM metallopeptidase with thrombospondin type 1 motif, 1
11 ENSG00000139132 FGD4 FYVE, RhoGEF and PH domain containing 4
12 ENSG00000162493 PDPN podoplanin
13 ENSG00000162692 VCAM1 vascular cell adhesion molecule 1
14 ENSG00000179094 PER1 period circadian clock 1
16 ENSG00000134243 SORT1 sortilin 1
17 ENSG00000163884 KLF15 Kruppel-like factor 15
18 ENSG00000178695 KCTD12 potassium channel tetramerization domain containing 12
19 ENSG00000146250 PRSS35 protease, serine, 35
20 ENSG00000198624 CCDC69 coiled-coil domain containing 69
21 ENSG00000148848 ADAM12 ADAM metallopeptidase domain 12

We can also re-analyze the data (i.e., build diff results tables) by changing the design as shown above or by specifying the contrast' argument of theresults` function, as shown below to look at the diff between two cell lines.

results(dds, contrast = c("cell", "N61311", "N052611"))
## log2 fold change (MAP): cell N61311 vs N052611 
## Wald test p-value: cell N61311 vs N052611 
## DataFrame with 64102 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE       stat    pvalue
##                 <numeric>      <numeric> <numeric>  <numeric> <numeric>
## ENSG00000000003 708.60217    -0.20680752 0.1368030 -1.5117175 0.1306057
## ENSG00000000005   0.00000             NA        NA         NA        NA
## ENSG00000000419 520.29790    -0.06389491 0.1490723 -0.4286169 0.6682020
## ENSG00000000457 237.16304     0.06428564 0.1827151  0.3518355 0.7249617
## ENSG00000000460  57.93263     0.34319847 0.2840573  1.2082017 0.2269697
## ...                   ...            ...       ...        ...       ...
## LRG_94                  0             NA        NA         NA        NA
## LRG_96                  0             NA        NA         NA        NA
## LRG_97                  0             NA        NA         NA        NA
## LRG_98                  0             NA        NA         NA        NA
## LRG_99                  0             NA        NA         NA        NA
##                      padj
##                 <numeric>
## ENSG00000000003 0.4342936
## ENSG00000000005        NA
## ENSG00000000419 0.8907923
## ENSG00000000457 0.9138413
## ENSG00000000460 0.5752896
## ...                   ...
## LRG_94                 NA
## LRG_96                 NA
## LRG_97                 NA
## LRG_98                 NA
## LRG_99                 NA
results(dds, contrast = c("dex", "trt", "untrt"))
## log2 fold change (MAP): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 64102 rows and 6 columns
##                  baseMean log2FoldChange      lfcSE       stat
##                 <numeric>      <numeric>  <numeric>  <numeric>
## ENSG00000000003 708.60217    -0.37424998 0.09873107 -3.7906000
## ENSG00000000005   0.00000             NA         NA         NA
## ENSG00000000419 520.29790     0.20215551 0.10929899  1.8495642
## ENSG00000000457 237.16304     0.03624826 0.13684258  0.2648902
## ENSG00000000460  57.93263    -0.08523371 0.24654402 -0.3457140
## ...                   ...            ...        ...        ...
## LRG_94                  0             NA         NA         NA
## LRG_96                  0             NA         NA         NA
## LRG_97                  0             NA         NA         NA
## LRG_98                  0             NA         NA         NA
## LRG_99                  0             NA         NA         NA
##                       pvalue        padj
##                    <numeric>   <numeric>
## ENSG00000000003 0.0001502838 0.001217611
## ENSG00000000005           NA          NA
## ENSG00000000419 0.0643763851 0.188306353
## ENSG00000000457 0.7910940556 0.907203245
## ENSG00000000460 0.7295576905 0.874422374
## ...                      ...         ...
## LRG_94                    NA          NA
## LRG_96                    NA          NA
## LRG_97                    NA          NA
## LRG_98                    NA          NA
## LRG_99                    NA          NA
results(dds)
## log2 fold change (MAP): dex trt vs untrt 
## Wald test p-value: dex trt vs untrt 
## DataFrame with 64102 rows and 6 columns
##                  baseMean log2FoldChange      lfcSE       stat
##                 <numeric>      <numeric>  <numeric>  <numeric>
## ENSG00000000003 708.60217    -0.37424998 0.09873107 -3.7906000
## ENSG00000000005   0.00000             NA         NA         NA
## ENSG00000000419 520.29790     0.20215551 0.10929899  1.8495642
## ENSG00000000457 237.16304     0.03624826 0.13684258  0.2648902
## ENSG00000000460  57.93263    -0.08523371 0.24654402 -0.3457140
## ...                   ...            ...        ...        ...
## LRG_94                  0             NA         NA         NA
## LRG_96                  0             NA         NA         NA
## LRG_97                  0             NA         NA         NA
## LRG_98                  0             NA         NA         NA
## LRG_99                  0             NA         NA         NA
##                       pvalue        padj
##                    <numeric>   <numeric>
## ENSG00000000003 0.0001502838 0.001217611
## ENSG00000000005           NA          NA
## ENSG00000000419 0.0643763851 0.188306353
## ENSG00000000457 0.7910940556 0.907203245
## ENSG00000000460 0.7295576905 0.874422374
## ...                      ...         ...
## LRG_94                    NA          NA
## LRG_96                    NA          NA
## LRG_97                    NA          NA
## LRG_98                    NA          NA
## LRG_99                    NA          NA

In addition, we can use surrogate variable analysis (SVA) to investigate hidden structure due to, e.g., batch effects or other surrogate variables.

Because, in our example, we know that our dataset includes data from two diff cell lines, we could just use the DESeq function with a ~ cell + dex model design to look for dex/treatment-specific differences controlling for the cell line of origin. But suppose we were given this data without the cell line information. We could use SVA to try to identify this hidden structure.

library(sva)
## Loading required package: mgcv
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:Biostrings':
## 
##     collapse
## 
## The following object is masked from 'package:IRanges':
## 
##     collapse
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse
## 
## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:base':
## 
##     anyNA
idx <- rowMeans(counts(dds)) > 0
dat <- counts(dds)[idx, ]
mod <- model.matrix(~ dex, colData(dds)) # full design
mod0 <- model.matrix(~ 1, colData(dds)) # reduced design
svseq <- svaseq(dat, mod, mod0, n.sv = 2)
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5
plot(svseq$sv[, 1], svseq$sv[, 2], col =  dds$cell, pch = 16)
legend("bottomright", levels(dds$cell), pch=16, col=1:4)

dds.sva <- dds
dds.sva$SV1 <- svseq$sv[, 1]
dds.sva$SV2 <- svseq$sv[, 2]
design(dds.sva) <- ~ SV1 + SV2 + dex
dds.sva <- DESeq(dds.sva)
## using pre-existing size factors
## estimating dispersions
## found already estimated dispersions, replacing these
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Differential expression analysis at the exon level using DEXSeq

The DEXSeq package offers differential testing of exon usage within each gene. We omit the python script calls for preparing the GFF file (from the GTF file) and the count tables (from the SAM/BAM files), but these can be found in the vignette along with additional information regarding the workflow outlined below.

library(pasilla)

dir <- system.file("extdata", package = "pasilla", mustWork = TRUE)
count_tables <- list.files(dir, pattern = "fb.txt$", full.names = TRUE)
gff_file <- list.files(dir, pattern = ".gff$", full.names = TRUE)
sample_table = data.frame(
  row.names = c( "treated1", "treated2", "treated3",
    "untreated1", "untreated2", "untreated3", "untreated4" ),
  condition = c("knockdown", "knockdown", "knockdown",
    "control", "control", "control", "control" ),
  libType = c( "single-end", "paired-end", "paired-end",
    "single-end", "single-end", "paired-end", "paired-end" ) )
print(sample_table)
##            condition    libType
## treated1   knockdown single-end
## treated2   knockdown paired-end
## treated3   knockdown paired-end
## untreated1   control single-end
## untreated2   control single-end
## untreated3   control paired-end
## untreated4   control paired-end
library(DEXSeq)
## Loading required package: BiocParallel
## 
## Attaching package: 'DEXSeq'
## 
## The following object is masked from 'package:Rsubread':
## 
##     featureCounts
dxd = DEXSeqDataSetFromHTSeq(count_tables,
  sampleData = sample_table,
  design = ~ sample + exon + condition:exon,
  flattenedfile = gff_file )
## converting counts to integer mode
# Subset to a much reduced number of genes (just for demontration, typically you would just run the analysis on the whole dataset)
geneids_in_subset <- read.table(file.path(dir, "geneIDsinsubset.txt"), as.is = TRUE)[[1]]
head(geneids_in_subset)
## [1] "FBgn0250871" "FBgn0015278" "FBgn0026777" "FBgn0051021" "FBgn0032979"
## [6] "FBgn0039914"
dxd <- dxd[geneIDs(dxd) %in% geneids_in_subset, ]

dxd <- estimateSizeFactors(dxd)
dxd <- estimateDispersions(dxd)
## using supplied model matrix
## using supplied model matrix
dxd <- testForDEU(dxd)
## using supplied model matrix
dxd <- estimateExonFoldChanges(dxd, fitExpToVar = "condition")

dxr <- DEXSeqResults(dxd)
plotMA(dxr, cex = 0.8)

plotDEXSeq(dxr, "FBgn0010909", legend = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)

plotDEXSeq(dxr, "FBgn0010909", displayTranscripts = TRUE, legend = TRUE, cex.axis = 1.2, cex = 1.3, lwd = 2)

Differential expression analysis at the isoform/transcript level using Cufflinks/Cummerbund

Here we show the exploratory plots offered by the cummeRbund package, which provides an R interface to Cufflinks output files. Additional information regarding the workflow outlined below can be found in the vignette.

First, we load in a directory containing the results from a Cufflinks analysis.

library(cummeRbund)
## Loading required package: reshape2
## Loading required package: fastcluster
## 
## Attaching package: 'fastcluster'
## 
## The following object is masked from 'package:stats':
## 
##     hclust
## 
## Loading required package: rtracklayer
## Loading required package: Gviz
## Loading required package: grid
## 
## Attaching package: 'cummeRbund'
## 
## The following object is masked from 'package:DESeq2':
## 
##     fpkm
## 
## The following objects are masked from 'package:GenomicFeatures':
## 
##     features, genes, promoters
## 
## The following object is masked from 'package:Biobase':
## 
##     samples
## 
## The following object is masked from 'package:GenomicRanges':
## 
##     promoters
## 
## The following object is masked from 'package:IRanges':
## 
##     promoters
## 
## The following object is masked from 'package:BiocGenerics':
## 
##     conditions
## 
## The following object is masked from 'package:dplyr':
## 
##     count
dir <- system.file("extdata", package = "cummeRbund")
gtf_file <- system.file("extdata/chr1_snippet.gtf", package = "cummeRbund")
cuff <- readCufflinks(dir, gtfFile = gtf_file, genome = "hg19", rebuild = TRUE)
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped
## Warning: attributes are not identical across measure variables; they will
## be dropped

Boxplots of expression (FPKM) at the gene and isoform level:

csBoxplot(genes(cuff)) + theme_bw()

csBoxplot(genes(cuff), replicates = TRUE) + theme_bw()

csBoxplot(isoforms(cuff), replicates = TRUE) + theme_bw()

Scatterplot matrix of gene and isoform level expression:

csScatterMatrix(genes(cuff)) + theme_bw()

csScatterMatrix(isoforms(cuff)) + theme_bw()

Sample dendrograms using Jensen-Shannon distances:

csDendro(genes(cuff), replicates = TRUE) + theme_bw()

## NULL
csDendro(isoforms(cuff), replicates = TRUE) + theme_bw()

## NULL

MA-plot comparing two conditions:

MAplot(genes(cuff), "hESC", "Fibroblasts") + theme_bw()
## Warning: Removed 54 rows containing missing values (geom_point).

MAplot(isoforms(cuff), "hESC", "Fibroblasts") + theme_bw()
## Warning: Removed 187 rows containing missing values (geom_point).

A “volcano plot” matrix. Each volcano plot is the -log10(p-value) over the log fold change.

csVolcanoMatrix(genes(cuff)) + theme_bw()

csVolcanoMatrix(isoforms(cuff)) + theme_bw()

Other software that performs isoform-level abundance estimation:


https://support.bioconductor.org/p/69643

For 3d pca plot:

y <- cpm(dge, log=TRUE, prior.count=3)
pc <- princomp(y)
library(rgl) 
plot3d(pc$scores[,1:3])

Filtering out consistently non-expressed probe-sets by far the most common filtering step, because keeping probe-sets in your analysis that are never expressed is hardly ever useful. Apart from that, it is better not to filter unless you know what you’re doing. If you plan to use limma for the differential expression analysis, then filtering is not much needed, especially if you use trend=TRUE in the eBayes step. I personally prefer to keep it simple. Do some some simple filtering on mean log-expression, or don’t filter at all.


https://support.bioconductor.org/p/69467

There are many things you can do to increase power to detect DE genes in limma. Two things that will probably make a big difference are:

  1. Filter out probes that aren’t expressed in your experiment, of which there are likely to be many.

  2. Turn on robust eBayes estimation, because the rma() algorithm is returning identical normalized values for some of the probe sets.

For example,

design <- model.matrix(~factor(c(1,2,1,2)))
fit <- lmFit(eset, design) 
keep <- fit$Amean > median(fit$Amean)
fit <- eBayes(fit[keep,], robust=TRUE, trend=TRUE)
topTable(fit, coef=2)

Beware that the variance filtering methods recommended by the article linked to must only be used with ordinary t-tests. They should not be used with limma. In a limma pipeline, filtering by mean log-expression is simpler and better.

he Bourgon et al article itself says that variance filtering is not appropriate with limma, but unfortunately not very prominently so many people miss it. Personally, I think that variance filtering is problematic in many or most pipelines, not just with limma. I don’t think it would be a good idea with any expression platform that produces a decreasing mean-variance relationship, as most do. Nor is it appropropriate with any of the leading microarray statistical tests (limma, SAM, maanova, CyberT etc). To me, the Affymetrix-rma-ttest pipeline is one of the very few where variance filtering is beneficial, because it unusually produces an increasing mean-variance relationship at low intensities. Yet even here I would hope to get more benefit out of limma instead.

As far as what one should do, most people agree that filtering low expressed or non-expressed probes is a good idea – it makes intuitive sense from both statistical and biological points of view. I must admit I never thought that a journal article was necessary on this, but maybe I was wrong.

Don’t do variance filtering – then the problems you mention will all disappear. Variance filtering is hardly ever a good idea, and almost certainly not with RNA-seq data.

The limma algorithm analyses the spread of the genewise variances. Any filtering method based on genewise variances will change the distribution of variances, will interfere with the limma algorithm and hence will give poor results.

Like most people, I recommend filtering out genes that don’t appear to be expressed in any sample. See for example Case studies 15.3 or 15.4 in the limma User’s Guide.

However you will find if you use eBayes(fit,trend=TRUE) instead of the usual eBayes(fit) that limma gives pretty good results regardless how much filtering you do, provided of course that the filtering is on expression and not on variance.

The literature tends to say that the reason for filtering is to reduce the amount of multiple testing, but in truth the increase in power from this is only slight. The more important reason for filtering in most applications is to remove highly variable genes at low intensities. The importance of filtering is highly dependent on how you pre-processed your data. Filtering is less important if you (i) use a good background correction or normalising method that damps down variability at low intensities and (ii) use eBayes(trend=TRUE) which accommodates a mean-variance trend.

http://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf http://bioinf.wehi.edu.au/RNAseqCaseStudy/

Dear Gordon

The literature tends to say that the reason for filtering is to reduce the amount of multiple testing, but in truth the increase in power from this is only slight. The more important reason for filtering in most applications is to remove highly variable genes at low intensities. The importance of filtering is highly dependent on how you pre-processed your data. Filtering is less important if you (i) use a good background correction or normalising method that damps down variability at low intensities and (ii) use eBayes(trend=TRUE) which accommodates a mean-variance trend.

With all respect, I think this paragraph mixes up two separate issues and can benefit from clarification.

  1. While literature can probably be found to support any statement, the above-cited reason is indeed bogus when multiple testing is performed with an FDR objective. The paper by Bourgon et al. motivates filtering differently, namely by using a filter criterion that is independent of the test statistic under the null (thus does not affect type-I error; some subtlety is discussed in that paper) but dependent under the alternative (thus improves power).

  2. “Highly variable genes at low intensities” are indeed a problem of bad preprocessing and are better dealt with at that level, not by filtering. Nowadays, the commonly used methods for expression microarray or RNA-Seq analysis that I am aware of avoid that problem.

  3. The question when & how independent filtering (as in 1) is beneficial is quite unrelated to preprocessing. You are right that FDR is a property of the whole selected gene list, not of individual genes, and that different approaches exist for spending the type-I error budget wisely, by weighting different genes differently; of which independent filtering is one and trended eBayes (which is not the default option in limma) may be another.

Reference: Bourgon et al. PNAS 2010: http://www.pnas.org/content/107/21/9546


There are methods in the genefilter package that can be used to filter data, and you can also remove probesets based on present/absent calls (http://www.bioconductor.org/packages/release/bioc/html/panp.html).

library(oligo) library(limma) library(genefilter)

http://www.gettinggeneticsdone.com/2014/05/r-volcano-plots-to-visualize-rnaseq-microarray.html

https://support.bioconductor.org/p/65729

GTF tools